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ABSTRACT 

This  paper  studies  the  estimation  of  coefficients  ^  in  single  index  models 
such  that  E(y| X)=F(a+X'^) ,  where  the  function  F  is  misspecified  or  unknown.  A 
general  connection  between  behavioral  derivatives  and  covariance  estimators  is 
established,  which  shows  how  8  can  be  estimated  up  to  scale  using  information 
on  the  marginal  distribution  of  X.  A  sample  covariance  estimator  and  an 
instrumental  variables  slope  coefficient  vector  are  proposed,  which  are 
constructed  using  appropriately  defined  score  vectors  of  the  X  distribution. 
The  framework  is  illustrated  using  several  common  limited  dependent  variable 
■odels,  and  extended  to  multiple  index  models,  including  models  of  selection 
bias  and  multinomial  discrete  choice.  The  asymptotic  bias  in  the  OLS 
coefficients  of  y  regressed  on  X  are  analyzed.  The  asymptotic  distribution  of 
the  instrumental  variables  estimator  is  established,  when  the  X  distribution 
is  modeled  up  to  a  finite  parameterization. 
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CONSISTENT  ESTIMATION  OF  SCALED  COEFFICIENTS 

1 .  Introduction 

This  paper  considers  the  generic  econometric  modeling  situation  in  which 
a  dependent  variable  y  is  modeled  as  a  function  of  a  vector  of  explanatory 
variables  X  and  stochastic  terms,  where  the  conditional  expectation  of  y  given 
X  can  be  written  in  the  single  index  form  E(y|X)  =  F(a+X'P).  This  situation 
exists  for  many  standard  models  of  discrete  choice,  censoring  and  selection, 
but  is  clearly  not  limited  to  such  models.  The  question  of  interest  is  what 
can  be  learned  about  the  coefficients  p  without  specific  assumptions  on  the 
distribution  of  unobserved  stochastic  terms  or  other  functional  form  aspects; 
in  other  words,  when  the  true  form  of  the  function  F  is  misspecified  or 
unknown . 

For  different  examples  of  limited  dependent  variables  models,  several 
researchers  have  studied  the  conditions  under  which  ordinary  least  squares 
(OLS)  regression  coefficients  and  other  quasi-maximum  likelihood  estimators 
will  consistently  estimate  p  up  to  a  scalar  multiple.  Ruud(1983)  points  out 
that  a  sufficient  condition  for  this  property  occurs  when  the  conditional 
expectation  of  each  component  of  X  given  Z  =  o+X'p  is  linear  in  Z,  which  is 
valid  when  X  is  multivariate  normally  distributed,  for  instance.  Chung  and 

6oldberger( 1984)  and  Deaton  and  Irish(1984)  point  out  the  sufficiency  of  an 

2 
analogous  condition  with  a  more  general  definition  of  Z. 

An  intriguing  feature  of  this  work  is  that  it  provides  special  cases 

where  knowledge  of  the  marginal  distribution  of  X  is  very  useful  for 

estimating  behavioral  effects  when  certain  features  of  the  true  model  are 

unknown.  The  question  is  immediately  raised  as  to  whether  more  general  results 

of  this  type  can  be  found,  because  as  Ruud(1983)  states,  the  above  sufficient 


conditions  are  "too  restrictive  to  be  generally  applicable."  Results  that 
apply  to  more  general  marginal  distribution  forms  are  of  substantial  practical 
interest  because,  in  general,  the  marginal  distribution  of  X  can  be 
empirically  characterized.  In  this  spirit,  Ruud(1984)  has  proposed  an 
estimation  technique  based  on  reweightlng  the  data  sample  so  that  weighted  X 
distribution  is  multivariate  normal. 

This  paper  proposes  an  approach  for  studying  p  based  on  estimation  of 
average  behavioral  derivatives,  and  shows  how  information  on  the  narginal 
distribution  of  X  can  be  used  to  estimate  average  derivatives.  In  particular, 
a  direct  link  between  average  derivatives  and  covariance  estimators  is 
established,  which  shows  how  p  can  be  estimated  up  to  a  scalar  multiple  by  the 
sample  covariance  between  y  and  appropriately  defined  score  vectors  of  the 
marginal  distribution  of  X.  p  is  also  consistently  estimated  up  to  scale  by 
the  slope  coefficients  of  the  linear  equation  of  y  regressed  on  X  using  the 
score  vectors  as  instrumental  variables. 

The  ratio  of  any  two  components  of  either  the  sample  covariance  or  the 
Instrumental  variables  coefficient  vector  will  consistently  estimate  the  ratio 
of  the  corresponding  components  of  p.  These  estimates  may  suffice  for  many 
applications,  such  as  judging  relative  marginal  utilities  in  a  discrete  choice 
situation.  More  broadly,  the  ratio  estimates  provide  a  consistent  benchmark 
for  choosing  specific  modeling  assumptions.  For  instance,  in  a  binary  discrete 
choice  situation,  separate  estimates  of  p  under  loglt  or  probit  modeling 
assumptions  can  be  judged  in  relation  to  the  consistent  ratio  estimates.  This 
method  of  assessing  specification  may  be  useful  in  any  modeling  situation 
where  alternative  functional  form  or  stochastic  assumptions  give  rise  to 
substantively  different  estimates  of  p. 

The  exposition  begins  with  notation,  examples  and  formal  assumptions  in 
Section  2.  Section  3  establishes  the  connection  between  average  derivatives 


and  covariance  estimators.  Section  4  applies  the  result  to  the  estimation  of  p 
up  to  scale,  as  well  as  to  estimation  of  parameters  of  more  general  multiple 
Index  models.  Section  5  studies  the  asymptotic  bias  of  the  vector  of  OLS 
coefficients  of  y  regressed  on  X,  as  an  estimator  of  the  true  average 
derivative,  and  as  an  estimator  of  p  up  to  scale.  Section  6  establishes  the 
asymptotic  distribution  of  the  proposed  instrumental  variables  estimator,  when 
the  distribution  of  X  is  modeled  up  to  a  finite  parameterization.  Section  7 
contains  concluding  remarks  and  topics  for  future  research. 

2 .  Notation,  Examples  and  Basic  Assumptions 

Consider  the  situation  where  data  is  observed  on  a  dependent  variable  y. 

and  an  M-vector  of  explanatory  variables  X.  for  1  =  1 N,  where  M  >  2. 

(y  ,X  ),  i=l N  represent  random  drawings  from  a  distribution  T  which  is 

absolutely  continuous  with  respect  to  a  o-finite  measure  v,   with  Radon-Nikodym 
density  P(y ,X)=aT/av.  P(y,X)  factors  as  P(y ,X)=q(y |X)p(X) ,  where  p(X)  is  the 
density  of  the  marginal  distribution  of  X.  The  conditional  density  q(y|X) 
represents  the  true  behavioral  econometric  model,  which  we  assume  permits  the 
conditional  expectation  E(y|X)  to  be  written  in  the  form 

(2.1)      E(y|X)  =  F(a+X'p)  =  F(Z) 

for  some  function  F,  where  a  is  a  constant,  p=0,  ,  .  .  .  ,p„) '  is  an  M-vector  of 

1       M 

constants,  and  Z  is  defined  as  Z=a+X'p.  I  refer  to  Z  as  an  Index  variable, 
with  (2.1)  a  single  index  model. 

This  framework  is  very  general,  subsuming  many  limited  dependent 
variables  models,  but  is  not  restricted  to  such  models.  Before  proceeding  to 
specific  examples,  it  is  useful  to  note  a  generic  special  case  of  (2.1). 
Suppose  that  Z   is  a  general  index  variable  such  that  £:=Z  -Z  is  independent  of 
X,  then  if  E(y|X,£)=F  (Z  )  for  some  function  F  .  (2.1)  is  implied.  This 


Includes  many  models  that  employ  a  latent  variable  Z  =a+X'^+£,  where  £  is 
Independent  of  X.  Note  also  that  this  implies  that  behavioral  variables  can  be 

omitted  from  X  without  affecting  the  results,  provided  that  the  omitted 

3 

variables  are  independent  of  the  included  ones. 

I  now  turn  to  some  specific  examples: 

Exaaple  1:  Binary  Discrete  Choice 

Suppose  that  y  represents  a  dichotomous  random  variable  modeled  as 

y  =  1      if  €  >  -(a+X'P) 
=  0      otherwise 

Here  E(y| X)=F(a+X' p )  is  the  probability  of  y=]  given  the  value  of  X,  with  the 

true  function  F  determined  by  the  distribution  of  £ .  If  £  is  distributed 

2 
normally  with  mean  0  and  variance  a  ,  then  the  familiar  probit  model  results, 

with  F(a+X'p)=<J(  (a+X'p)/o)  ,  where  <t  is  the  cumulative  normal  distribution 

function.  Logit  models,  etc.,  can  easily  be  Included. 

Example  2:  Tobit  Models 

*  * 

Suppose  that  y  is  equal  to  an  index  Z  only  if  Z   is  positive,  as  in  the 

censored  toblt  specification 

y  =  a+X'p+£        if  £  >  -(a+X'p) 
=  0  otherwise 

Alternatively,  if  y  and  X  are  observed  only  when  £  >  -(a+X'p),  we  have  the 
truncated  tobit  specification. 


Example  3:  Dependent  Variable  Transformations 

Suppose  there  exists  a  function  g(y)  such  that  the  true  model  Is  of  the  form 

g(y)  =  a+X'^+^ 

where  g(y)  is  invertible  everywhere  except  for  a  set  of  measure  0.  A  specific 
example  is  the  familiar  Box-Cox  transformation  where 

y^^'  =  a+X'^+£ 

with  y^^^  =  [(y^-l)/X]  for  \xO,   y^^^  =  ln(y)  for  >.=0. 

These  examples  serve  to  illustrate  the  wide  spectrum  of  models  covered  by  the 
single  index  form  (2.1)  with  general  function  F.  and  many  other  examples  can 
be  found.  Multiple  index  models  are  considered  in  Section  4.2. 

We  now  turn  to  the  other  required  assumptions.  X  is  assumed  to  be 

4 
continuously  distributed,  having  support  fi  of  the  following  form: 


M 

Assumption  1:  fl  is  a  convex  subset  of  R  with  nonempty  interior.  The 

underlying  measure  v  can  be  written  in  product  form  as  v=v  ^v   ,   where  v     is 

y  A        A 

Lebegue  measure  on  R  . 


Therefore,  no  component  of  X  is  functionally  determined  by  other  components  of 

X,  and  no  two  components  of  X  are  perfectly  correlated. 

5 
Denote  ft(X)  as  the  score  vector  of  the  marginal  density  p(X)  as: 

(2.2)      MX)=-^i^^ 

The  main  regularity  conditions  on  the  marginal  density  p(X)  are 


Assumption  2:  p(X)  is  continuously  dif ferentlable  in  the  components  of  X  for 
all  X  in  the  Interior  of  Q.    E(fi)  and  E(ASL')    exist. 

Assumption  3:  For  XedO,  where  dfi  is  the  boundary  of  0,  we  have  p(X)=0. 

M 
Assumption  3  allows  for  unbounded  X's,  where  fi=R  and  dfi=0.  While  the 

majority  of  the  results  employ  Assumptions  2  and  3.  the  incorporation  of 

discrete  (qualitative)  explanatory  variables  is  discussed  In  Section  4.2. 

I  will  make  reference  to  the  following  set  of  regularity  conditions  on  a 

general  random  variable  y  and  its  conditional  expectation  E(yjX)=G(X). 

(y,G(X))  satisfies  condition  A  if 

Condition  A:  G(X)  is  continuously  dif ferentlable  for  all  Xefi,  where  Q   differs 
from  n  by  a  set  of  measure  0.  E(y) ,  E(3G/aX)  and  E(fiy)  exist. 

The  main  regularity  condition  on  the  behavioral  model  (2.1)  is  contained  in 

Assumption  4:  a)  (y,F(a+X'p))  satisfies  condition  A.  E(dF/dZ)  is  nonzero. 

b)  (X.,X.)  satifies  condition  A  for  each  j=l,...,M. 
J  J 

This  completes  the  list  of  main  assumptions.  While  somewhat  formidable 
technically,  these  assumptions  are  collectively  very  weak. 

The  main  thrust  of  the  paper  concerns  how  information  on  the  marginal 
density  p(X)  can  be  used  to  estimate  p  up  to  scale.  Consequently,  the  majority 
of  the  exposition  assumes  that  the  value  of  il(X)  at  each  X.  is  known,  and 

denoted  fi  =A(X.),  i=l N.  Use  of  empirical  characterizations  of  p(X)  is 

discussed  in  Section  6. 

Finally,  sample  averages  are  denoted  via  overbars  as  in  y=  Ey./N,  with 


the  means  of  y  and  X  denoted  as  p  =E(y)  and  >j  =E(X).  Sample  covariances  are 

y        ^ 

denoted  using  S  as  in  S  =  Z(A . -fl) (y . -y )/N,  with  population  counterparts 
denoted  using  I  as  in  I-  =Cov(il  ,y)  . 


3 .  Behavioral  Derivatives  and  Covariance  Estimators 

This  section  presents  a  fundamental  connection  between  behavioral 
derivatives  and  covariance  estimators,  that  is  the  basis  of  the  consistency 
results  of  Section  4.  The  connection  is  given  in  the  following  theorem,  which 
Is  interpreted  after  the  proof. 

Theorem  1:  Given  Assumptions  1-3.  if  {y,G(X))  satisfies  condition  A,  then 


(3.1)      e[|||  =  E(ft(X)y)  =  Z 


Ay 


Proof:  Let  X  denote  the  first  component  of  X.  and  apply  Fubini's  Theorem 
(c.f.  Billingsley(1979) ,  among  others)  to  write  E(aG/aX  )  as 


(3.2) 


aG(X) 
ax  ~ 


P(X)  dv  =  j   I   ^^p(X)dv^(X^) 
L(X^) 


o      o 


where  X  represents  the  other  components  of  X.  The  result  that 

E(aG/aX  )=E(ft  (X)y)  is  implied  by  the  validity  of  the  following  equation 


^^■^^  1   ^  P(>^)d^l(^)  =  -  1  G(X)   ^dv.^(X,) 


u)(X 


o,(XJ 


since  the  RHS  of  (3.3)  simplifies  to 


(3.4) 


-1 


G(X)   ^dv^(X^) 
u.(X^)        ' 


|G(x)r-^^^llp(x)dv^(: 


.(X^) 


By  inserting  (3.3,4)  into  (3.2),  E(aG(X)/aX^)=E(A  (X)G(X))  is  established,  and 


by  Iterated  expectation.  E(ft  (X)G(X)  )=E(il  (X)y)  . 

To  establish  (3.3),  note  first  that  the  convexity  of  Q   implies  that  u)(X  ) 

is  either  a  finite  interval  [a,b]  (where  a,  b  depend  on  X  ),  or  an  infinite 

o 

interval  of  the  form  [a,*),  (-»,b]  or  (-•,«»).  Supposing  first  that 

w(X  )  =  [a,b].  integrate  the  LHS  of  (3.3)  by  parts  (c.f.  Bimngsley(  1979) )  as 

(3.5,    f  =£m  p,x,a.  ,x^,  .  -  |\,x,  2|Pd.^,x^, 

a     1  a         1 

+  G(b.X^)p(b.X^)  -  G(a,X^)p(a.X^) 

The  latter  two  terms  represent  G(X)p(X)  evaluated  at  boundary  points,  which 

vanish  by  Assumption  3,  so  that  (3.3)  is  established  for  ui(X  )=[a,b]. 

For  the  unbounded  case  u)(X  )  =  [a,«>),  note  first  that  the  existence  of 

o 

E(y),  E(aG/axj  and  E(ftJX)y)  respectively  imply  the  existence  of  E(G(X)|X  ), 
11  o 

E(aG/aX  |X  )  and  E(A  (X)G(X)|X  )  (c.f.  Kolmogorov( 1950) ) .  Now  consider  the 
limit  of  (3.5)  over  intervals  [a,b],  where  b-»»,  rewritten  as 

(3.6)    lim  G(b.X  )p(b.X  )  =  G(a,X  )p(a,X  )  +  lim   f   ^^^^   p(X)dv,(Xj 
,  O       O  0       o     ,      J      oX^  1   1 

b-*«o  b-»"   a     1 


.b 
b-»<»  "a  ~"1 


+  lim   I   G(X)   ^fP  dv^(X^) 


=  G(a,X  )p(a,X  )  +  p  (X  )Erf^|x  1  -  p  (X  )E(fl  f  X)G(X)  I X  ) 
o      o     o  o   L9X,   oj    *^o  0    1       '  o 


so  that  C  =  lim  G(b,X  )p(b,X  )  exists,  where  p  (X  )  is  the  marginal  density  of 

o      o  o  o 

X  .  Now  suppose  that  C>0.  Then  there  exists  scalars  t   and  B  such  that  0<€<C 
o 

and  for  all  b  >  B.  |G(b,X  )p(b,X  )-C|<e.  Therefore  G(X  ,X  )p(X  .X  )  > 

(C-£)Ir_   ,,  where  !,„   ,  is  the  indicator  function  of  [B,<»).  But  this  implies 
IB,»)         [B,*) 

that    p    (X    )E(G(X)|X^)    =   ;G(X,,X^)p(X^,X^)dv.(X,)    >    (C-€)    JI,_   „,dv    (X^)    =  ». 

OO  O  lOiOli  [D,'*j         1        o 

which  contradicts  the  existence  of  E(G(X)|X  ).  Consequently,  OO  is  ruled  out. 

o 

C<0  similarly  contradicts  the  existence  of  E(G(X)|X  ). 


Since  C  =   lim  G(b,X    )p(b,X    )    =   0,    and  G(a.X    )p{a.X    )=0  by  Assumption   3. 

0        0  0        0 

equation  (3.3)  is  valid  for  u)(X  )=La,»).  Analogous  arguments  establish  the 
validity  of  (3.3)  for  u)(X  )  =  (-«>. a]  and  u)(X^)  =  ( -<»,»)  . 

The  second  equality  of  (3.1);  E(ft  (X)y)=Cov(fl  (X) ,y ) ;  is  true  because  the 
mean  of  A    (X)  is  0.   The  proof  is  completed  by  repeating  the  same  development 
for  derivatives  of  G(X)  with  respect  to  X^ X^^.  QED 


Theorem  1  is  of  significant  theoretical  interest.  It  says  that  the 
average  behavioral  derivative  E(aG/aX)  can  be  written  as  the  covariance 
between  y  and  a  function  of  X;  namely  ft(X).  The  form  of  ft(X)  does  not  depend 
on  the  behavioral  relation  E(y|X)=G(X):  Jl(X)  is  determined  only  by  the 
marginal  density  p(X) .  Thus,  Theorem  1  establishes  a  general  link  between 

behavioral  derivatives  and  covariance  estimators,  that  does  not  depend  on 

7 
assumptions  on  the  form  of  behavior.   The  proof  is  extremely  simple,  based  on 

integration-by-parts . 

A  useful  intuition  for  Theorem  1  can  be  obtained  from  its  connection  to 

results  in  aggregation  theory.  In  particular,  Theorem  1  reflects  the  local 

aggregate  effects  on  E(y)  of  translating  the  base  density  p(X).  To  see  this 

M 
connection,  consider  the  unbounded  case  where  fi=R  .  Suppose  that  the  base 

density  is  translated  by  an  M-vector  8;  p(X)  is  altered  to  p(X-e)  for  all  X. 

The  value  of  E(y)  after  this  translation  is  given  as 

(3.7a)     E(y|8)  =  [  G(X)p(X-8)dv 

By  a  change-of-variables,  E(y|8)  can  also  be  written  as 
(3.7b)     E(y|e)  =  [  G(X+8)p(X)dv 

The  local  aggregate  effects  of  the  translation  are  the  derivatives 
8E(y|B)/ae  evaluated  at  6=0.   Differentiating  (3.7a)  under  the  integral  sign 


and  evaluating  at  B=0  gives 

(3.8a)    ^^^   =  j  G(X)  If  dv  =  j  G(X)  ^2_2  p(x)dv  =  |  G(X)  MX)  p(X)dv 

where  the  latter  equality  reflects  that  k.(X)    equals  aln  p(X-e)/a8  evaluated  at 
8=0.  Similarly,  differentiating  (3.7b)  gives 

Collecting  the  equalities  of  (3.8a,b)  gives  E(G(X)Jl(X) )=E(aG/3X) ,  which 
underlies  equation  (3.1)  of  Theorem  1. 

Theorem  1  thus  has  a  simple  geometric  explanation.  For  evaluating  the 
mean  E(y)  under  translation,  one  can  average  G(X)  over  the  distribution  p(X) 
shifted  by  8  (equation  (3.7a)),  or  one  can  shift  G(X)  by  -9  and  average  over 
the  distribution  p(X)  (equation  (3.7b)).  The  local  effects  on  E(y)  can  be 
computed  from  either  perspective  (equations  (3.8a,b))  to  yield  the  same  value. 
Equation  (3.1)  just  exhibits  this  equivalence. 

4.  Consistent  Estimation  of  Scaled  Coefficients 

This  section  indicates  how  to  estimate  p  up  to  scale  for  single  index 
models  of  the  form  (2.1).  Section  4.1  indicates  the  basic  approach  and 
proposes  a  covariance  estimator  and  an  instrumental  variables  estimator. 
Section  4.2  discusses  immediate  extensions  of  the  basic  results  and  Section 
4.3  gives  some  further  remarks. 

4 . 1  The  Average  Derivative  Approach  to  Estimation 

Begin  by  considering  a  precise  empirical  Implication  of  the  single  index 
model  form  E(yjX)=F(a+X'p) .  Clearly,  the  conditional  mean  of  y  depends  only  on 
X  through  the  value  of  X'p.  By  exploiting  differentiability,  a  precise 
restriction  of  the  single  index  form  is  given  as 
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^^■^'       ax        ax      LdzJ  ^ 


Thus.  aE(y|X)/3X  is  proportional  to  p,  although  the  scale  factor  dF/dZ  will 
depend  on  the  value  of  X  chosen. 

The  basic  approach  in  this  paper  is  to  focus  on  the  average  of  the 
constraint  (4.1): 

where  Y=E(dF/dZ)  exists  and  Is  nonzero  by  Assumption  4.  Clearly,  any 
consistent  estimate  of  the  average  derivative  E(aF/aX)  is  a  consistent 
estimator  of  p  up  to  scale. 

Two  natural  consistent  estimators  are  suggested  by  Theorem  1.  First 
define  the  estimator  d  as  the  sample  covariance  between  y.  and  ft.; 

The  second  estimator  is  more  closely  related  to  standard  regression 
estimators,  such  as  the  OLS  coefficients  of  y  regressed  on  X.  Define  d  as  the 
instrumental  variables  coefficients  of  the  regression 


(4.4)  y.=c  +  X'.  d  +  u, 

1        1      i 

obtained  using  (l,ft.')  as  the  instrumental  variable;  namely 

(4.5)  d  =  (S^^)-^S^y 

The  consistency  of  d  and  d  for  vp  follows  immediately  from  Theorem  1,  as  in 


Theorem  2:  Given  Assumptions  1-4,  d  and  d  are  strongly  consistent  estimators 
of  Yp.  where  Y=E(dF/dZ). 
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Proof:  The  Strong  Law  of  Large  Numbers  (c.f.  Rao(1973),  Section  2c. 3,  SLLN2) 
implies  that  lim  S  ^  ^n  •  Theorem  1  and  (4.2)  imply  that  lim  d  =  Yp  a.s.. 
lim  d  =  yP  a.s.  follows  if  lim  S   =  I   =1,  an  MxM  identity  matrix.  In  view 

of  Assumption  4b),  Theorem  1  can  be  applied  with  y=X . ,  for  each  j  =  l M. 

Carrying  this  out  gives  1.       =    I.  QED 

A  A 


The  two  estimators  d  and  d  appear  very  similar,  however  in  general  they 
are  not  first-order  (iJN)  equivalent.  In  particular, 

(4.6)      4N(d  -  d^)  =  ^(S'^y.   -   I)  cIq 

Since  lim  d  =Yp*0,  and  >IN(S  -I)  in  general  has  a  nontrivial  limiting 
distribution,  >JN(d-d  )  will  not  vanish  as  N-*".  For  expository  purposes,  I  will 

refer  to  d  for  the  remainder  of  the  exposition,  however,  all  consistency 

9 
results  can  be  extended  to  d  . 

The  connection  to  the  aggregate  effects  of  translation  permits  a  further 

Interpretation  of  the  scale  factor  y  =  E(dF/dZ).  The  structure  of  the  single 

index  model  (2.1)  implies  that  the  local  aggregate  effects  of  translation  are 

proportional  to  p,  the  parameters  of  interest.  In  particular,  insert  (4.1) 

into  (3.8b) ,  giving 

,^„      JEi^  .  Ill  ,  ^„„,  .  ,, 

Q 

This  appearance  of  p  is  due  to  the  correspondence  between  density  translation 

and  the  linear  form  of  the  index  Z=a+X'p.  To  interpret  Y,  note  that  under 
translation,  the  marginal  distribution  of  Z  is  shifted  by  the  parameter  11=9 'p, 
with  the  mean  of  Z  increased  by  t).  (4.7)  can  be  regarded  as  the  chain  rule 
formula  aE(y)/ae  =  (dE(y)/dti)  (dri/aS)  .  where  Qr\/aQ   is  equal  to  p.  The  scale 
factor  Y  is  equal  to  dE(y)/dTi,  the  effect  on  E(y)  induced  by  a  change  in  the 
mean  E(Z)  of  the  index  variable  Z. 
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4. 2  Extraneous  Variables  and  Multiple  Index  Models 

The  approach  of  parameter  estimation  via  average  derivatives  easily 
extends  to  more  general  models  than  those  relying  on  a  single  index.  In  this 
section  I  consider  some  immediate  extensions,  namely  to  models  with  extraneous 
variables  and  multiple  index  models. 

Begin  by  expanding  the  notation  to  consider  two  sets  of  explanatory 
variables;  an  M  vector  X  and  an  M  vector  X  .  distributed  with  density 

J.  J.  IM  M 

p(X  .X  ).   Consider  first  the  case  where  X  are  extraneous  variables,  in  that 
12  ^ 

the  behavioral  model  for  y  implies 

(4.8)      E(y|X^.X2)  =  F(a^+X^ 'p^ .X^)  =  FCZ^.X^) 


for  some  function  F  with  constant  coefficients  a  ,  p  ,  and  2.=a  +X  'P  .  In 
this  case,  p   is  proportional  to  the  (partial)  derivative  of  F  with  respect  to 
X  ,  as  in 

u-       ^^^^^^  =  II;  =  [ll;K 

1  11 

so  that  the  average  derivative  is  proportional  to  p  : 


•"■'"I      ^[i-]  '  4if-]^  =  ^^ 


Theorems  1-3  can  be  applied  as  long  as  the  appropriate  analogues  of 

Assumptions  1-4  applied  to  X  are  valid.  In  particular,  the  proof  of  Theorem  1 

will  apply  to  individual  components  of  X  provided  that  no  two  components  of 

X-,X^  are  perfectly  correlated,  and  that  the  conditional  density  p  (X  |X  ) 
12  lie 

vanishes  on  the  boundary  of  X  values  for  each  value  of  X  .  Under  these 

J.  ^ 

conditions,  the  sample  covariance  d  =S    consistently  estimates  V  p  ,  where 

the  partial  score  H ,  .=H  AX,  .  ,X^.)    is  defined  via 

^  li   1   li   2i 
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ain  p(X  X  )    am  p^(X  |x  ) 

(4.11)  .^(x^.x^)  =  -  —J^-—   =  -    ax)     - 

Moreover,  Y  p   is  consistently  estimated  by  the  slope  coefficients  estimates 
d  of  the  linear  equation 

(4.12)  y.  =  c,  +  X, .d^  +  u, . 

1    1    li  1    li 

obtained  by  instrumenting  with  (1,11   ')'.  Thus,  the  extraneous  variables  X 
are  accomodated  in  the  estimation  of  p  by  modification  of  the  appropriate 
instrumental  variables,  to  reflect  the  joint  distribution  of  X  and  X  . 
Clearly  if  X  were  distributed  independently  of  X  ,  then  X  can  be  ignored  in 

^  J.  b 

the  estimation  of  p  up  to  scale. 

This  extension  provides  an  initial  response  as  to  how  to  accomodate 
discrete  explanatory  variables  into  the  analysis.  If  X  is  composed  of 
discrete  variables,  an  approach  based  on  average  derivatives  is  not  obviously 
applicable  to  estimating  effects  of  X  .  However,  the  coefficients  of  the 
remaining  continuous  variables  X  can  be  estimated  up  to  scale  by  using  the 
score  vectors  of  the  conditiona]  density  of  X  given  the  observed  values  of  X 
as  instrumental  variables.  Consequently,  while  the  analysis  is  silent  on  how 
to  estimate  coefficients  of  discrete  variables,  their  presence  does  not 
prohibit  the  estimation  of  continuous  variable  coefficients  up  to  scale. 

Putting  aside  this  proviso  on  discrete  variables,  I  now  turn  to  multiple 
Index  models.  All  relevant  points  are  exhibited  by  two  index  models,  so  assume 
that  X=(X  ',X  ')'  is  composed  entirely  of  continuous  variables  with  M  >2  and 
M  >2.  Suppose  that  the  behavioral  model  implies  the  following  two  index  form 

(4.13)  E(y|X)  =  F(a^+X^'p^,  V^g'Pg)  =  ^^2i'^2^ 

where  Z  =a  +X  'p  and  Z  =a  +X  'p  represent  the  two  index  variables.  The 

X.  X.  \.  x^  £t  £i  £t  Ci 


14 


derivative  of  the  conditional  expectation  now  takes  the  form 


(4.14) 


aE(y|X)  ^  aF 
ax     ax 


[1-]^ 


.  lazj^2  . 


so  that  the  average  derivative  is 

41  = 


(4.15) 


1*^1 

L  2^2J 


where  Y  =E(aF/aZ  )  and  y  =E(3F/az  )  are  scalar  constants.  Thus  a  consistent 
estimate  of  the  average  derivative  will  estimate  p  and  p  up  to  scale, 
however  the  scale  factors  Y   and  Y  will  differ  in  general. 

Such  a  consistent  estimate  has  already  been  established,  provided  that 
y.F  of  (4.13)  obeys  condition  A.  Namely,  the  estimator  d  of  (4.5)  consistently 
estimates  E(aF/aX),  so  that  its  components  corresponding  to  X  estimate  B  up 

to  scale,  and  its  components  corresponding  to  X  consistently  estimate  p  up 

12 
to  scale.   The  main  modeling  limitation  of  this  result  is  that  no  two 

components  of  X  and  X  may  be  functionally  related  or  perfectly  correlated. 

Thus  the  index  variables  Z  and  Z  may  have  no  common  component  variables,  an 

exclusion  restriction  that  is  required  for  estimating  both  p  and  p  up  to 

13 
scale  using  average  first  derivatives.    The  following  example  gives  a  two 

index  model,  where  Y,=l  a  priori. 


Example  4:  Selection  Bias 

Suppose  that  the  basic  behavioral  model  is  y^*^!"^^! '  Pi*^i  •  ^^^  ^^^^   Y-  ^i  ^"'^ 

* 
X  are  observed  only  if  Z   =<^2^^2 '^2^^2  ^  °'  *'^^^^    ^^l'^2^  ^^  distributed 

independently  of  (X  ,X  ).  This  implies  that 


15 


ElvlX^.X^.Z^  >0)  =  a^  +  X^'p^  +  ^'^l^^  -<°2  ^  ^2*^2^^ 

so  that  d  will  estimate  the  structural  parameters  p     and  the  selection 
parameters  p  up  to  scale,  without  explicit  assumptions  on  the  joint 
distribution  of  {Z    ,t    ).    Note  that  aF/aZ  =1 .  so  that  Y  =1 .  Thus,  the 
components  of  d  corresponding  to  X  will  consistently  estimate  p  ,  with  no 
proviso  about  scale. 


By  comparing  Example  4  and  the  truncated  tobit  specification  of  Example  2, 
there  are  two  polar  cases  where  selection  parameters  are  estimated  up  to  scale 
by  d,  namely  when  the  selection  index  Z  has  no  variables  in  common  with  the 
structural  index  Z  .  or  when  the  selection  index  Z   is  equal  to  the  structural 
Index  Z  . 

I  close  with  another  example,  that  further  Illustrates  how  exclusion  and 
other  parameter  restrictions  bear  on  the  estimation  of  specific  coefficients 
up  to  scale. 


Example  5:  Discrete  Choice  Among  Several  Alternatives 

Suppose  that  one  is  studying  the  choice  between  j=l,...,J  discrete 

alternatives.  The  attractiveness  (utility)  of  the  j   alternative  is  modeled 

as 


where  X   is  a  set  of  option  specific  explanatory  variables,  with 

X  =(X ^1  j)  containing  no  two  components  that  are  perfectly  correlated. 

X  represents  explanatory  variables  that  are  observation  specific,  but  bear  on 

the  attractiveness  of  each  option.  £  .  is  a  random  term,  such  that  (£  z    ) 

J  1      J 
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14 
is  distributed  independently  of  (X  ,X  ).    The  parameters  a.,  P,  .,  5.  may  vary 

1   <-  J   1 J   J 

with  option  j. 

Focus  on  the  J   alternative,  and  assume  that  y=l  if  J  is  chosen  and  y=0 
if  another  alternative  is  chosen.  Define  J-1  index  variables 


where  p  .=5  -S  . .    Now  J  is  chosen,  or  y^l ,  when  V  .  <  V  for  j  =  l , . . . , J-1 .  This 
occurs  when 

€.  -  €  <  Z.  for  all  j  =  l J-1 

J    "^    J 

E(y|X  ,X  )  is  just  the  probability  of  the  above  event  given  X  ,X  ;  or 

J.     ^  X     6 

ElylX^.X^)  =  F(Z^,...,Zj_^) 

where  F  is  the  cummulative  distribution  function  of  (e  -€  €...-€,).  For 

instance,  if  (£  -£  €   -€ J  is  multivariate  normally  distributed,  F  is 

the  multivariate  normal  distribution  function  (and  this  is  a  multinomial 
probit  model)  . 

Now,  what  is  estimated  by  d=(d   ' d   ',d  ')',  partitioned  according 

to  (X  ,X  )  =  (X   X   ,X  )?  The  coefficients  d   of  the  J^^  specific 

attributes  X   will  consistently  estimate 

r,ra¥    T   r„r^  aF 

E 


[IH  -  H^  H-  ]]  ^,  =  ^^, 

iJ      J   J 


so  that  d   consistently  estimates  p  up  to  scale.  For  jxJ,  the  coefficients 

d,  ,  of  X,  .  will  estimate 
IJ     IJ 


-iwr)  =  (-[If-]]  ^J  =  '/u 


so  that  d.  .  consistently  estimates  6  .  up  to  scale.  Finally,  the  coefficients 
d  of  X  will  consistently  estimate 
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2     J      J     -^ 

so  that  d  estimates  a  linear  combination  of  the  parameters  p   ,  j  =  l J-]. 

Consequently,  the  respective  components  of  d  will  consistently  estimate 

the  option  specific  parameter  vectors  p  .,  j  =  l J,  up  to  scale  values.  This 

occurs  because  X   appears  in  each  index  with  the  same  coefficients  p   ,  and 

1-  U  J,  J 

for  j*J,  X   only  appears  in  the  single  index  Z  .  p   ,  j=l J-1,  are  not 

■^  J  J   ^  J 

separately  estimated  because  X  appears  with  a  different  coefficient  value  in 

each  index  Z  .  . 
J 


4.3  Further  Remarks 


4 . 3a  A  Note  on  Heteroscedastic  Disturbances 

As  indicated  in  Section  2,  the  development  applies  to  models  where 
y=f (o+X'p+£) ,  where  z   is  distributed  independently  of  X.  Often  it  is  desirable 
to  estimate  p  in  situations  where  £  is  heteroscedastic,  with  the  distribution 
of  €  depending  on  X.  Estimators  that  are  robust  to  heteroscedasticity  of  t   for 
specific  models  are  given  in  Manskl ' s( 1975, 1985)  work  on  maximum  score 
estimation  and  Powell ' s( 1984)  work  on  censored  least  absolute  deviations 
estimation . 

It  is  easy  to  see  in  general  that  d  will  not  estimate  p  up  to  scale  when 
the  distribution  of  £  depends  on  X.  In  this  case,  the  conditional  expectation 
E(y|X)  will  not  be  a  function  of  a+X'p  alone,  depending  in  general  on  how  X 

alters  the  distribution  of  £  over  observations.  Equations  (2.1)  and  (4.1)  will 

15 
not  hold,  which  breaks  the  relation  between  p  and  the  average  derivative. 

The  one  special  case  where  heteroscedasticity  of  €  does  not  alter  the 

consistency  of  d  for  p  up  to  scale  is  when  the  distribution  of  £  depends  only 

on  the  value  of  the  index  Z=a+X'p.  Here  (2.1)  and  (4.1)  are  valid,  with  the 

development  applying  without  modification.  While  this  is  a  strong  restriction. 
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some  models  obey  this  restriction,  such  as  truncated  Poisson  regression 
models.  A  good  survey  of  this  and  related  models  is  given  in  Manski ( 1984a) . 

4.3b:  The  Statistical  Role  of  the  X  Distribution 

An  interesting  feature  of  the  estimators  suggested  by  Theorem  1  is  their 
explicit  dependence  on  the  density  p{X)  of  the  marginal  distribution  of  X.  In 
particular,  the  consistency  of  the  estimators  relies  on  the  fact  that  the  data 

X. ,  i=l N  represents  a  random  sample,  so  that  the  X's  are  not  taken  as 

ancilllary  for  estimation. 

This  raises  a  rather  deep  statistical  issue  concerning  the  efficiency  of 
the  estimators,  which  is  described  as  follows.  The  overall  object  of 
estimation  is  to  measure  the  value  of  p  up  to  scale.  ^  is  clearly  a  parametric 
feature  of  the  conditional  distribution  of  y  given  X,  and  so  there  is  no 
generic  necessity  for  knowing  the  marginal  distribution  of  X.  The  usefulness 
of  the  information  provided  by  the  density  p(X)  is  more  surprising  than 
natural,  when  viewed  in  this  light. 

The  role  of  the  density  p(X)  is  built  into  the  particular  estimation 
strategy  employed  here,  namely  the  estimation  of  the  average  derivative 
E(aF/aX)  of  (4.2).  The  value  of  the  E(aF/aX)  clearly  depends  on  the  true 
marginal  density  p(X)  -  altering  the  marginal  density  will  alter  the  average 
derivative.  In  other  words,  even  if  the  exact  form  of  the  conditional  density 
of  y  given  X  were  known  for  all  X  values,  the  average  E(aF/aX)  could  not  be 
consistently  estimated  without  reference  to  the  configuration  of  the  X  values 
in  a  large  sample. 

But  estimation  of  the  average  derivative  does  not  represent  the  only 
conceivable  method  of  estimating  p  up  to  scale.  This  can  be  seen  from  equation 
(4.1),  which  is  a  derivative  constraint  on  the  conditional  expectation  of  y 
given  X,  and  does  not  involve  the  density  of  X.  This  suggests  that  more 
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efficient  estimators  of  ^  up  to  scale  couJd  be  found,  which  take  X  as 
ancilliary  (i.e.  which  condition  on  the  observed  data  values  X.,  1=1,..., N), 
No  such  general,  more  efficient  estimators  are  known  to  the  author,  however 
the  possibility  of  their  existence  warrants  Investigation  via  future 
research . 


5.  Biases  in  Ordinary  Least  Squares  Coefficients 

The  linear  structure  of  d  suggests  a  natural  comparison  with  the  OLS 
slope  coefficient  vector  b=(S   )   S   from  the  regression  of  y  on  X  , 

AA      Ay  J,        X 

17 
i=l,...,N.    This  section  uses  the  above  development  to  study  the  asymptotic 

bias  in  b  as  an  estimator  of  the  true  average  derivative  E(aF/aX),  and  as  an 

estimator  of  p  up  to  scale.  The  primary  focus  is  on  the  role  of  the 

distribution  of  X,  as  the  formulae  below  are  applicable  regardless  of  the  true 

form  of  the  function  F  of  (2.1). 

Begin  by  considering  the  circumstances  under  which  k(X)    is  collinear  with 

X.  If  so,  then  b^d,  which  gives  a  robust  Interpretation  of  b  as  an  estimator 

of  p   up  to  scale,  or  alternately  a  case  where  d  is  particularly  easy  to 

compute.  Now  suppose  that  fi(X)=A+BX,  where  A  is  an  M  vector  and  B  an  MxM 

matrix  of  constants,  and  denote  y  =E(X).  Since  E(JI(X))=0.  A=-BE(X)=-Bjj  ,  so 

A  A 

that  il(X)=B(X-M  ) .  By  integrating  A(X).  In  p(X)  can  be  written  in  the  form 

A 

In  p(X)=C-(l/2)(X-M  )'B(X->j  )  for  some  constant  C.  Thus  p(X)  must  be  of  the 

A         A 

Multivariate  normal  form,  with  B=(X   )   .  Consequently  b  and  d  coincide  only 

AA 

when  X  is  multivariate  normally  distributed. 

Theorem  1  appears  in  simple  form  in  this  case.  A(X)  =  (Z.^y)   (X-p  ),  so 

XX        X 

that  E{!l(X)y)=Cov(S.(X)  ,y)  =  CZ^^)'h..    .    This  is  clearly  the  a.s.  limit  of  the 

Xa    xy 

^       - 1     18 
OLS  coefficient  vector  b,  namely  b  =  lim  b=(X.^.y.)  Z^  . 

XX    Xy 

To  study  the  asymptotic  bias  of  b  when  X  is  not  normally  distributed, 
first  consider  the  difference  between  the  average  derivative  E(aF/aX)  and  b. 
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Since  E(aF/aX)=E(il(X)y)  by  Theorem  1, 

(5.1)  e[||]  -  b  =  E([(ft(X)  -  (^x^'^^^'^X^^y'  ^  E(R(X)y)  =  Cov(R(X)  .y) 

where  R(X)  s  ft  (X)-(I^^)  "'^(X-^^)  ,  and  the  last  equality  follows  from  E(R{X))=0. 

XX        X 

Notice  that  R(X)  can  be  regarded  as  a  large  sample  OLS  residual  vector. 
The  OLS  coefficients  B  of  the  multivariate  regression  equation 
a(X.)=B(X^-Mj^)+R{X^)  are  such  that  lim  B=(Ij^j^)"  Ij^  =  (i:j^j^)'  a.s..  since  \^=l 
by  the  proof  of  Theorem  2.  Thus  lim  R(X  )  =  R(X.)  a.s.,  with  R(X)  interpreted 
as  the  large  sample  least  squares  departure  of  ft(X)  from  X.  In  particular, 
R(X)=0  for  all  X  only  if  X  is  normally  distributed. 

Equation  (5.1)  says  that  b  consistently  estimates  the  average  derivative 
E(aF/aX)  only  when  y  is  uncorrelated  with  the  least  squares  difference  R(X) 
between  ft(X)  and  X.  Thus,  unless  X  is  normally  distributed,  b  will 
consistently  estimate  the  average  derivative  E{aF/aX)  only  in  certain  model- 
specific  cases.  To  consider  the  role  of  the  true  model  in  this  property, 
refine  equation  (5.1)  as  follows.  Since  R(X)  is  a  least  squares  residual,  R(X) 
is  uncorrelated  with  X.  Therefore  y  can  be  replaced  in  (5.1)  by  the  large 

sample  residual  ?=(y-M  )-(X->i  )'b  from  the  OLS  regression  of  y  on  X,  as  in 

y     X 

(5.2)  e[||]  -  b  =  Cov(R(X),y)  =  Cov(R(X),^) 

There  are  two  natural  polar  cases  under  which  b  will  estimate  the  average 
derivative  E(aF/aX);  first  when  R(X)=0,  or  when  X  is  normally  distributed,  and 

second  when  E(^|X)=0,  or  when  the  true  model  between  y  and  X  is  a  linear 

19 
regression  model.    In  nonlinear  cases,  b  will  estimate  E(aF/aX)  only  if  the 

specific  functional  form  assures  that  the  OLS  residual  £,   is  uncorrelated  with 

R(X).  Finally,  note  that  (5.1,2)  do  not  utilize  the  single  index  form  (2.1), 

so  that  y,F  could  be  replaced  in  the  above  discussion  by  any  y,G(X)  obeying 

condition  A. 
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At  face  value,  this  suggests  that  the  conditions  under  which  b  estimates p 
up  to  scale  may  also  be  restricted  to  X  normally  distributed.  However, 
Ruud (1983.1 984 ) ,  Chung  and  Goldberger( 1984 )  and  Deaton  and  Irish(1984)  have 
pointed  out  another  sufficient  condition  on  the  distribution  of  X  that  does 
not  restrict  X  to  be  normally  distributed.  In  particular,  Ruud(1983a)  shows 
that  b  will  consistently  estimate  p  up  to  scale  when  E(X|Z)=G+HZ,  for  Z=a+X'P, 
or  that  E(X|Z)  is  a  linear  function  of  Z.  Chung  and  Goldberger{ 1984 )  and 
Deaton  and  lrish{1984)  find  the  same  result  using  an  analogous  condition  with 
a  generalized  definition  of  Z.  Chamberlain( 1983)  has  pointed  out  that  this 
condition  is  obeyed  when  the  distribution  of  X  is  (elliptically)  symmetric, 
but  not  necessarily  normal  (see  also  Dempster  (1969)).  Consequently,  equations 
(5.1,2)  do  not  suffice  to  characterize  the  asymptotic  bias  of  b  as  an 
estimator  of  p  up  to  scale. 

A  bit  more  development  yields  a  bias  formula  that  explicitly  displays  the 
role  of  the  linearity  condition.  Assume  first  that  the  relationship  between  y 
and  X  can  be  represented  by  y=f (o+X' p+e )=f (Z+€)  for  some  (unknown)  function  f, 
where  z    is  distributed  independently  of  X.  Denote  the  marginal  distribution  of 
Z=a+X'p  as  p  (Z),  the  mean  of  Z  as  ^  =E(Z)  and  the  associated  log-density 
derivative  as  A  (Z)=-  d  In  p  (Z)/dZ.  Define  the  large  sample  residual  of  ft  (Z) 

Li  Li  Li 
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regressed  on  Z  as 

(5.3)      r^(Z)  =  k^(Z)    -  a~^(Z-M^) 


Clearly  r  (Z)=0  for  all  Z  if  and  only  if  Z  is  normally  distributed.  Finally, 
define  the  large  sample  residual  vector  of  E(X|Z)  regressed  on  Z  as 

(5.4)      r^(Z)  =  (E(X|Z)-p^)  -  )i(l-)x^) 

2 
where  H=Z  /o   .  Clearly  r  (Z)=0  for  all  Z  if  and  only  if  the  linearity 

A^      ii  X 
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condition  holds,  or  that  E(XiZ)=G+HZ.  The  formula  to  be  derived  relates  the 

asymptotic  bias  in  b  to  covariances  between  y  and  the  residuals  r  (Z)  and 

r^(Z). 

Recall  that  for  the  single  index  model  (2.1).  we  have  that  E(8F/6X)=Yp. 

where  Y=E(dF/dZ)=J(dF/dZ)p  (Z)dv.  By  applying  Theorem  1.  Y  can  be  written  as 

Y=E(il  (Z)F(Z))=Cov(Jl  (Z),y).  Thus,  the  average  derivative  E{aF/8X)  can  be 
Z  u 

written  as 


(5.5)    e[|^  =  p  cov(ft2(z).  y) 


To  characterize  the  limit  b=(Z   )  Z^  of  b.  note  first  that 

I^  =E[ (E(X|Z)-}jy)y] .  This  is  valid  because  at  a  given  value  of  Z.  the 
Xy  X 

conditional  covariance  between  X-E(X|Z)  and  y=f(Z+€)  is  zero.  Now 
(5.6)      b  =  (5:j^^)"^E[(E(X|Z)->ij^)y) 

=  E[(T^^)~^H(Z-ii^)w]    +  (i:jjj^)"^E(r^(Z)y) 
by  using  (5.4).  Note  that  by  construction.  ^  =  (Zy  )  S^^ ,  so  that  (S„  .)  H 


'XX'    XZ'         'XX' 


-1       2        2 
^^Xx'  ^KZ^°Z     "  ^^°Z    '    ^"^^''^^"S  this  gives 


(5.7)  b  =  p  E(a^^(Z-i^^)y)    +  {J.^^)    ^E(r^(Z)y) 

=  ^  Cov(o2^(Z->i2).y)  +  (Ij^)"^Cov(r^(Z).y) 

The  desired  c  '^s  formulae  is  obtained  by  combining  (5.5),  (5.7)  and  (5.3)  to 
yield 

(5.8)  e[|£|  -  b  =  P  Cov(r^(Z).y)  +  (Ij^j^)'^Cov(rj(Z)  .y) 

Equation  (5.8)  provides  a  categorization  of  the  asymptotic  bias  in  the 

OLS  estimator  b  vis-a-vis  the  X  distribution  underlying  the  data.  When  X  is 

multivariate  normally  distributed.  Z  is  also  normal,  r  (Z)=0  and  r,(Z)=0  for 

n  1 

all  Z.  and  b  consistently  estimates  the  average  derivative  E(aF/aX)=Yp.  When  X 
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is  not  normally  distributed  but  the  linearity  condition  holds,  r  (Z)=0,  and  b 
consistently  estimates  (Y+Cov(r  (Z).y))$.  The  covariance  term  will  not  be  zero 
in  general,  but  the  asymptotic  bias  E(aF/aX)-b  is  proportional  to  p,  to  that  b 
still  estimates  p  up  to  scale.  Finally,  b  will  not  consistently  estimate  $  up 
to  scale  in  general  if  r  (Z)?:0. 

Thus,  for  a  model-free  interpretion  of  b  as  the  average  derivative 
E(aF/aX),  multivariate  normality  of  the  X  distribution  Is  essential.  For  the 
question  of  when  b  estimates  p  up  to  scale,  the  linearity  condition 

E{X|Z)=G+HZ  provides  a  solution  that  Is  not  directly  related  to  estimation  of 

21 
the  average  derivative. 


6.  Distribution  Theory  with  a  Parametric  Density  Form 

The  above  exposition  has  proposed  the  estimator  d  as  an  estimator  of  p  up 
to  scale,  that  explicitly  utilizes  information  on  the  marginal  density  p(X). 
When  p(X)  is  in  the  multivariate  normal  form,  d  can  be  computed  as  the  OLS 

slope  coefficients,  and  scale-free  inferences  on  the  value  of  ^  (as  discussed 

22 
below)  can  be  performed  with  standard  methods.     In  general,  a  statistical 

characterization  of  the  density  p(X)  will  be  required  to  impliment  d.  This 

section  establishes  the  asymptotic  distribution  of  d  when  the  density  is 

modeled  via  a  finite  parameterization. 

Suppose  that  p(X)=p  (X|A  ),  where  p  (X|A)  denotes  a  parametric  family, 
with  A  an  L-vector  of  parameters  that  characterize  the  location  and  shape  of 
p(X);  means,  variances,  skewness,  etc.  The  density  score  vector  is  determined 
by  A  as  Il(X)=Jl(X| A  )  .  Assumptions  5  and  6  of  the  appendix  assume  that  a  /n 
consistent  estimator  A  of  A=A^  can  be  computed  using  the  data  X.,  1=1,..., N, 
as  well  as  some  regularity  conditions. 

Estimation  now  proceeds  in  two  steps.  First  estimate  A  using  A.  Next 
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compute  d  as  the  instrumenta]  variabies  estimator  using  the  estimated 


instrument 


(5.1)      a,  =  ii(X.|A) 


ain  p  (X.  (A) 


i   ~^"ii"'       ax 


as  in 


(5.2)      d*  =  (S^j^)-^S^y 

The  strong  consistency  and  asymptotic  normality  of  d  is  established  in 


Theorem  3:  Given  Assumptions  1-6,  (a)  d   is  a  strongly  consistent  estimator  of 

~* 

Yp,  and  (b)  the  limiting  distribution  of  'JN(d  -  Yp)  is  multivariate  normal 

with  mean  0  and  variance-covariance  matrix 

(5.3)  V   =   I„      „      +    DZ„'         H    5l.      ,D'    +    DI..D' 

^         '  ilu.Jlu  ftu.X,  Jlu,X  kX 

where  u=(y-p  )-(X-p  )'Yp,  fiu=Il(X)u,  D=E[u(aJl(X|A  )/aA)]  and  X  is  the  component 
y      A  u 

of  A  defined  in  the  Appendix. 


Proof:  (a):  Define  A  (A)=fi  (X .  |  A)  ,  so  that  fi.(A  )=ft.  and  ft^(A)=Jl^.  To  show 

consistency  of  S*   for  T^    .  define  S  (A)=Dl . (A) (y .-y)/N,  so  that  S  (Aq)=S 

and  S  (A)=S?  .  From  (A. 2a)  of  Assumption  6,  Theorem  2  of  Jennrich( 1969) 
y     Ay 

implies  that  S  (A)  converges  uniformly  in  A  to  E[ft(X|A) (y-p  ) ] .  Since  lim  A=Aq 
a.s..  by  Lemma  4  of  Amemiya(  1973)  ,  lim  S'  =lim  S  (A)=E[il  (X|  A^)  (y-p  )]=!., 

a.s.  A  similar  argument  (using  (A2.b)  of  Assumption  6)  shows  that  lim  S"  =i:  „ 

"*      -1 
a.s.,  so  that  lim  d  =(!„„)  I«  =vp  a.s. 

ftX    Ay 

(b):  Following  Newey(1984).  define  u . =(y . -y)-(X.-X) 'Yp,  and  write 
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(5.4)       >lN(d       -    Yp)    =    (S;^)"^[— j^ij 


Z   A  .u 

A  Taylor  series  expansion  of   the  second  term  gives 

rZ  ft  u  1                    rE  u   [3ft(x   |A  )/aA]-j 
(5.5)      >IN(d    -YP)    =    (S^^)       [--^-J.    (S^^)       [-^ -i-5 J^(A-A^).o 


P^^' 


The  result  follows  from  lim  S"  =1  =1,  an  identity  matrix,  and 
plim[i:u.(aft(X.  jA  )/aA)/N]=D  (Weak  Law  of  Large  Numbers).   QED 


Under  the  additional  regularity  conditions  (A.2c-g)  of  Assumption  6  of  the 

appendix,  a  consistent  estimator  of  the  variance-covariance  matrix  V  can  be 

constructed  as  follows.  Define  u. =(y . -y)-(X, -X) 'd  as  the  estimated  residual 

111 

from  equation  (4.4)  using  d  as  coefficient  estimates,  define  X..=X(X.|A)  as 
the  estimated  component  of  A,  and  define  the  estimator  D=Eu  [aft (X. | A)/3A]/N  of 


D.  It  is  easy  to  verify  that  V=Z(fl .u . +DX . ) (ft ,u . +DX . ) ' /N  is  a  consistent 
estimator  of  V. 

Thus  when  the  density  p(X)  is  modeled  up  to  a  finite  parameterization, 

-* 
inferences  on  the  value  of  Yp  can  be  performed  using  d  and  the  consistent 

estimate  of  its  variance-covariance  matrix  V.  Of  more  interest  are  tests  on 

hypotheses  on  the  value  of  p  available  from  d  ;  namely  hypotheses  that  are  not 

affected  by  the  true  value  of  Y.  The  main  class  of  such  scale-free  hypotheses 

are  homogeneous  linear  restrictions  of  the  form  v'p=0,  where  k  is  an  M-vector 

of  constants.  This  class  Includes  zero  restrictions  such  as  p.=0,  equality 

restrictions  such  as  p^=p.,  and  ratio  restrictions  such  as  e^=K  .8  .  for  a 

constant  k..  Tests  are  possible  by  noting  that  k ' p=0  is  equivalent  to 

K'(Yp)=0,  and  that  K'd   is  asymptotically  normal  with  mean  K'(VP)  and  variance 
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k'Vk.  In  particular,  under  the  null  hypothesis  that  k'P=0,  the  Wald  statistic 

**  2   "  2 

(K'd  )  /k'Vk  has  a  limiting  X   (1)  distribution. 

Wald  statistics  corresponding  to  joint  hypothesis, of  M'  <  M  linear 

homogeneous  restrictions  can  likewise  be  formulated  using  d  and  V.  Moreover, 

if  <t>0)  is  any  homogeneous  M'-vector  function  of  p,  tests  of  <l)O)=<t)(YP)=0  can 

be  formulated  using  the  "delta  method"  of  Billingsley{1979)  and  Rao(1973). 


7.  Concluding  Remarks 

This  paper  proposes  an  approach  to  parameter  estimation  based  on  average 
behavioral  derivatives,  and  applies  the  approach  to  the  estimation  of  p  up  to 
scale  in  single  index  models.  The  proposed  estimators  explicitly  utilize 
information  on  the  marginal  distribution  of  the  explanatory  variables  in  the 
model.  The  framework  is  illustrated  using  several  examples  of  limited 
dependent  variables  models,  and  extended  to  multiple  index  models.  The 
asymptotic  biases  in  OLS  coefficients  are  characterized  vis-a-vis  the 
distribution  of  explanatory  variables. 

There  are  two  major  advantages  to  the  proposed  estimator  d.  First,  d  is 
nonparametric  to  the  extent  that  it  is  robust  to  many  specific  functional  form 
and  stochastic  distribution  assumptions.  If  a  particular  application  requires 
only  estimates  of  the  ratios  of  components  of  p,  then  d  will  suffice.  In  a 
general  application  where  different  sets  of  assumptions  give  rise  to  different 
estlnates  of  ^,  d  will  provide  a  benchmark  estimate  for  choosing  the  best 
specification.  Given  parametric  modeling  of  the  explanatory  variable 
distribution,  the  precision  of  the  components  of  d  can  be  measured,  and  tests 
of  scale-free  hypotheses  on  the  value  of  p  can  be  performed. 

The  other  advantage  of  d  is  computational  simplicity.  Once  the 
distribution  of  explanatory  variables  is  characterized,  d  (as  well  as  d  )  is  a 
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linear  estimator,  computed  entirely  from  sample  covariances.  This  suggests 
that  implimentation  may  be  particularly  easy  and  inexpensive,  especially  for 
large  data  bases. 

There  are  also  two  drawbacks,  which  suggest  natural  future  research 
topics.  First,  the  results  apply  only  to  the  estimation  of  coefficients  of 
continuous  variables,  but  most  applications  to  microeconomic  data  will  require 
using  discrete  as  well  as  continuous  explanatory  variables.  While  the  presence 
of  discrete  variables  can  be  accomodated  in  the  estimation  of  continuous 
variable  coefficients,  the  question  of  how  to  nonparametrically  estimate 
coefficients  of  discrete  variables  remains  open. 

The  second  drawback  involves  the  empirical  characterization  of  the 
explanatory  variable  distribution.  While  this  distribution  can  in  principle 
always  be  characterized,  I  have  only  established  attractive  statistical 
properties  for  d  when  the  distribution  is  modeled  up  to  a  finite 
parameterization,  with  the  required  score  vectors  computed  from  the  estimated 
distribution  parameters.  Of  substantial  practical  importance  is  the  question 
of  whether  nonparametric  estimators  of  the  score  vectors  can  be  utilized  in 

the  construction  of  d,  to  give  a  consistent  estimator  of  yp  with  a 

23 
Straightforward  asymptotic  distribution  theory.    Thus,  the  results  of  this 

paper  provide  further  reasons  for  giving  high  priority  to  the  application  of 

nonparametric  techniques  to  econometric  modeling. 
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Appendix:  Additional  Assumptions 

The  following  assumptions  are  utilized  in  Section  6. 

Assumption  5:  p  (X|A)  is  twice  dlf f erentiable  in  the  conponents  of  A  in  an 
open  neighborhood  of  A=A  .  The  estimator  A  of  A=A  is  strongly  consistent,  and 
can  be  written  in  the  form 

(A.l)       A  =Aq  . jj .  Op[;^J 

where  E(X,(X|A  ))=E(X)=0,  and  the  variance-covariance  matrix  E(XX')=I   exists. 


This  assumption  implies  that  as  N-»»,  >rN(A-A  )  has  a  limiting  normal 
distribution  with  mean  0  ana  variance  T  .■    If  A  is  a  sample  average;  say 
A=Eg{X.)/N.  then  X=g(X)-A  .  If  A  is  the  maximum  likelihood  estimator,  then 

under  standard  conditions  >^  =  (I^)   9ln  p(X|A  )/aA,  where 

2  2 

I  =  -E(a  In  p(X|A  )/8A  )  is  the  information  matrix. 

The  following  regularity  condition  is  also  used. 


Assumption  6:  The  covariance  matrix  of  Au.  and  the  covariance  between  any  two 
components  of  Siu   and  X  exist.  The  mean  of  u(  aft  .  (X|  A  )/aA  )  exists  for  all 

J      U      B 

1  2 

j=l M  and  C=l L.  There  exists  measurable  functions  h.(y,X),  h..,(X), 

h^g,{X),  hJj(y.X).  hjgj,(X),  hjg(y,X)  and  hjj,g(X)  for  all  C,C'=1 L; 

j,j'=l M  such  that 


(A. 2a)      |yAj(X|A)|  <  hj(y,X) 

(A. 2b)      |Xjfij,(X|A)|  <  hjj,(X) 

(A. 2c)      |Xg(X|A)Xg,(X|A)|  <  h^g,(X) 
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(A. 2d)  |yXg{XiA)ilj(X|A)|  <h^j(y,X) 

(A.2e)  |XAg(X|A)ftj,(X|A)|  <  hJjj.CX) 

(A.2f)  |y(ail^(X|A)/aA^)|  <hjg(y.X) 

(A.2g)  |Xj(aftj,(X|A)/aAg)|  <  hjj,g(X) 


for  all  A  in  an  open  neighborhood  of  A=A  ,  where  E(h   )  exists  for  j"=l,...,7, 
for  all  «,e' . j, j' . 
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Notes 


1.  It  is  well  known  that  coefficient  estimates  are  sensitive  to  specific 
stochastic  distribution  assumptions  in  many  limited  dependent  variable 
contexts.  For  instance,  Heckman  and  Slnger(1984)  illustrate  the  sensitivity  of 
estimates  for  duration  models,  and  establish  an  approach  based  on 
nonparametrically  estimating  the  stochastic  heterogeneity  distribution. 

2.  See  also  Brillinger(1982) ,  Goldberger ( 1981 ) ,  Greene( 1981 . 1983)  . 
Lawley(1943)  and  Singh  and  Ullah(1985),  Stewart( 1983) .  among  others. 

3.  This  framework  differs  slightly  from  that  of  Chung  and  Goldberger{ 1984 ) 
and  Deaton  and  Irish(1984),  since  those  papers  only  require  t  (my  notation)  to 
be  uncorrelated  with  X. 

M 

4.  The  support  fi  is  defined  as  the  closure  of  the  set  {X€R  |p(X)>0}. 

5.  This  terminology  is  due  to  that  fact  that  a(X)  is  the  score  vector  of  p 
with  respect  to  a  translation  parameter  -  see  Section  3. 

6.  This  is  shown  by  noting  that  condition  A  is  satisfied  by  y=G(X)=l.  a 
constant  variable,  and  applying  (3.3,4). 

7.  A  similar  link  is  used  to  establish  the  consistency  of  OLS  estimators 
for  the  standard  linear  model.  Namely,  the  functional  form  assumption  that 
E(y|X)=G(X)=a+X'^  implies  Cov{X,y)=I.^^^ ,    or  Cov(Ij^^X,y  )=p .  By  the  same 
assumption,  the  behavioral  effects  are  p=aG(X)/aX=E(aG(X)/aX) .  The  latter 
correspondence  underlies  the  practical  usefulness  of  the  standard  linear 
model  . 

8.  Stoker(1986)  gives  a  general  development  of  local  aggregate,  or 
■acroeconomic  effects. 

9.  Other  consistent  estimators  of  Yp  include  the  "product  moment" 

estimator  d  =EA.y./K,  the  "reduced  form"  OLS  estimator  of  the  slope 

1    1  1  A   -  -   A  '     -1-1 

coefficients  of  y .=c-+d  X.+u  . ,  where  X  =(Z^^)  k^,   and  the  weighted  OLS 

estimator  proposed  by  Ruud(1984).  None  of  these  estimators  are  first-order 

equivalent  to  either  d  or  d  in  general. 

10.  This  expanded  notation  is  used  in  this  section  only. 

11.  X  then  takes  on  the  same  role  as  the  random  term  €  of  section  2. 

12.  Notice  that  d  of  (4.12)  consistently  estimates  Y  p  ,  and  that  the 
analogous  coefficients  from  the  linear  equation  with  X  as  explanatory 
variables  will  estimate  ^  p  . 
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13.  If  X.  appears  in  both  Z,  and  Z„ ,  its  coefficient  in  d  will  estimate 

J  12 

Y  3  .+Y„6^..  Thus  common  variables  will  have  coefficient  estimates  that  are 
I'^lj   2  2j 

the  sum  of  the  average  derivatives  Induced  from  Z  and  Z  . 

14.  The  framework  is  subject  to  the  "order  independence"  property  of 

Tversky(1972) ;  see  McFadden( 1981 ) .  This  can  be  relaxed  without  changing  any  of 

the  substantive  points  of  this  example  by  allowing  the  joint  distribution  of 

{€.)  to  depend  on  X  ,  but  not  X  . 
J  2  1 

15.  Note  that  p  would  be  consistently  estimated  by  d  if  the  instrument 
ft(X)  were  redefined  as  the  score  of  the  conditional  distribution  of  X  given  e, 
by  treating  z   as  an  extraneous  variable  as  above.  However,  one  could  never 
compute  these  instruments,  since  the  value  of  e  for  each  data  point  must  be 
known  as  well  as  the  conditional  density  of  X  given  £. 

16.  Nonparametric  regression  function  estimators  could  be  used  to  estimate 
p  directly  from  (4.1).  See  Stone(1977)  among  many  others,  and  Prakasa- 
Rao(1983)  for  a  survey  of  these  methods. 

17.  In  this  section  it  is  implicitly  assumed  that  the  population 

covariance  matrices  >.„„  and  Z.,  exist. 

XX     T(y 

18.  The  translation  interpretation  of  the  result  is  also  straightforward 
within  this  context.  In  particular,  if  X  is  normally  distributed  with  mean  >i 

A 

and  variance-covariance  matrix  Z   ,  then  Z=a+X'p  is  normally  distributed  with 

AA 

mean  p  =a+>j  'p  and  variance  o  =p'Z  p.  The  translated  density  p(Xie)  is 
Z     A  ZZ     XA 

normal  with  mean  jj  +6  and  variance-covariance  matrix  S   ,  with  the  translated 

A  AA 

density  of  Z  normal  with  mean  a+y  'p+B'P  and  variance  o   .  Thus  the  mean  of  y 

A  ctij 

under  translation  varies  only  with  6'$,  so  that  the  aggregate  effects 
aE(y|8)/a8  are  proportional  to  p. 

19.  ^  may  be  heteroscedastic ,  so  that  this  case  includes  standard 
heteroscedastic  linear  models  as  well  as  linear  models  with  random 
coefficients,  where  the  coefficients  are  distributed  independently  of  the 
Included  X  variables. 

20.  The  residual  interpretation  of  r  (Z)  Is  established  along  the  same 

n 

line  as  the  interpretation  of  R(X)  above. 

21.  For  example,  the  linearity  condition  Is  valid  when  the  distribution  of 
X  is  elliptically  symmetric,  as  in  p(X)=p''[  ( 1/2)  (X-m„)  T„„"^(X-p^)  ]=p"'(5  (X) ) . 

A     AA         A  ^ 

Here  ft  (X)=w(5  (X)Z  '  (X-p^)  ,  where  w(5)=  -  ain  p"^/a5.  When  u)(5)>0  for  all  S,    d 
is  the  weighted  OLS  coefficient  estimator  of  (4.4),  where  the  i   observation 
(y.,X.)  is  weighted  by  >la)(5(X.)).  Note  that  a)(5(X.))  =  l  when  p(X)  is  the 
multivariate  normal  distribution. 
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22.  The  variance  of  d  is  estimated  using  the  "heteroscedasticlty 
consistent"  estimator  of  White(1980). 

23.  Nonparametric  estimators  of  the  score  vectors  can  be  proposed  using 
several  methods,  as  surveyed  by  Manski( 1984b)  and  Prakasa-Rao{ 1983) .  Gallant 
and  Nychka(1985)  prove  consistency  of  d  when  score  vectors  are  estimated  using 
Hermite  polynomial  approximations. 
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